Get started with building a model in this R Markdown document that accompanies Evaluate your model with resampling tidymodels start article.

If you ever get lost, you can click on the header of the knitted document to see the accompanying section in the online article.

Take advantage of the RStudio IDE and use “Run All Chunks Above” or “Run Current Chunk” buttons to easily execute code chunks.

Introduction

Load necessary packages:

library(tidymodels) # for the rsample package, along with the rest of tidymodels

# Helper packages
library(modeldata)  # for the cells data

The cell image data

Load cell image data:

data(cells, package = "modeldata")
cells
## # A tibble: 2,019 x 58
##   case  class angle_ch_1 area_ch_1 avg_inten_ch_1 avg_inten_ch_2 avg_inten_ch_3
##   <fct> <fct>      <dbl>     <int>          <dbl>          <dbl>          <dbl>
## 1 Test  PS        143.         185           15.7           4.95           9.55
## 2 Train PS        134.         819           31.9         207.            69.9 
## 3 Train WS        107.         431           28.0         116.            63.9 
## 4 Train PS         69.2        298           19.5         102.            28.2 
## 5 Test  PS          2.89       285           24.3         112.            20.5 
## # … with 2,014 more rows, and 51 more variables: avg_inten_ch_4 <dbl>,
## #   convex_hull_area_ratio_ch_1 <dbl>, convex_hull_perim_ratio_ch_1 <dbl>,
## #   diff_inten_density_ch_1 <dbl>, diff_inten_density_ch_3 <dbl>,
## #   diff_inten_density_ch_4 <dbl>, entropy_inten_ch_1 <dbl>,
## #   entropy_inten_ch_3 <dbl>, entropy_inten_ch_4 <dbl>,
## #   eq_circ_diam_ch_1 <dbl>, eq_ellipse_lwr_ch_1 <dbl>,
## #   eq_ellipse_oblate_vol_ch_1 <dbl>, eq_ellipse_prolate_vol_ch_1 <dbl>,
## #   eq_sphere_area_ch_1 <dbl>, eq_sphere_vol_ch_1 <dbl>,
## #   fiber_align_2_ch_3 <dbl>, fiber_align_2_ch_4 <dbl>,
## #   fiber_length_ch_1 <dbl>, fiber_width_ch_1 <dbl>, inten_cooc_asm_ch_3 <dbl>,
## #   inten_cooc_asm_ch_4 <dbl>, inten_cooc_contrast_ch_3 <dbl>,
## #   inten_cooc_contrast_ch_4 <dbl>, inten_cooc_entropy_ch_3 <dbl>,
## #   inten_cooc_entropy_ch_4 <dbl>, inten_cooc_max_ch_3 <dbl>,
## #   inten_cooc_max_ch_4 <dbl>, kurt_inten_ch_1 <dbl>, kurt_inten_ch_3 <dbl>,
## #   kurt_inten_ch_4 <dbl>, length_ch_1 <dbl>, neighbor_avg_dist_ch_1 <dbl>,
## #   neighbor_min_dist_ch_1 <dbl>, neighbor_var_dist_ch_1 <dbl>,
## #   perim_ch_1 <dbl>, shape_bfr_ch_1 <dbl>, shape_lwr_ch_1 <dbl>,
## #   shape_p_2_a_ch_1 <dbl>, skew_inten_ch_1 <dbl>, skew_inten_ch_3 <dbl>,
## #   skew_inten_ch_4 <dbl>, spot_fiber_count_ch_3 <int>,
## #   spot_fiber_count_ch_4 <dbl>, total_inten_ch_1 <int>,
## #   total_inten_ch_2 <dbl>, total_inten_ch_3 <int>, total_inten_ch_4 <int>,
## #   var_inten_ch_1 <dbl>, var_inten_ch_3 <dbl>, var_inten_ch_4 <dbl>,
## #   width_ch_1 <dbl>

Look at proportion of classes:

cells %>% 
  count(class) %>% 
  mutate(prop = n/sum(n))
## # A tibble: 2 x 3
##   class     n  prop
##   <fct> <int> <dbl>
## 1 PS     1300 0.644
## 2 WS      719 0.356

Data splitting

Define a split object stratified by class column:

set.seed(123)
cell_split <- initial_split(cells %>% select(-case), 
                            strata = class)

Apply the split to obtain training (cell_train) and test (cell_test) sets.

cell_train <- training(cell_split)
cell_test  <- testing(cell_split)

nrow(cell_train)
## [1] 1515
nrow(cell_train)/nrow(cells)
## [1] 0.7503715
# training set proportions by class
cell_train %>% 
  count(class) %>% 
  mutate(prop = n/sum(n))
## # A tibble: 2 x 3
##   class     n  prop
##   <fct> <int> <dbl>
## 1 PS      975 0.644
## 2 WS      540 0.356
# test set proportions by class
cell_test %>% 
  count(class) %>% 
  mutate(prop = n/sum(n))
## # A tibble: 2 x 3
##   class     n  prop
##   <fct> <int> <dbl>
## 1 PS      325 0.645
## 2 WS      179 0.355

We will work with the training set data for the majority of the modeling steps.

Modeling

This time we don’t need to preprocess the data as much, therefore we will not use a recipe. Let’s create the model specification for a random forest model, using ranger engine and set mode to classification.

rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

See ?rand_forest for possibile engines, modes and further details.

Now let’s fit the model on our training dataset using the formula:

set.seed(234)
rf_fit <- 
  rf_mod %>% 
  fit(class ~ ., data = cell_train)
rf_fit
## parsnip model object
## 
## Fit time:  2.5s 
## Ranger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, num.trees = ~1000,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      1515 
## Number of independent variables:  56 
## Mtry:                             7 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1218873

Estimating performance

What happens if we evaluate model performance with the training data?

Predict the same fitted model with training dataset:

rf_training_pred <- 
  predict(rf_fit, cell_train) %>% 
  bind_cols(predict(rf_fit, cell_train, type = "prob")) %>% 
  # Add the true outcome data back in
  bind_cols(cell_train %>% 
              select(class))

Look at the performance results:

rf_training_pred %>%                # training set predictions
  roc_auc(truth = class, .pred_PS)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary          1.00
rf_training_pred %>%                # training set predictions
  accuracy(truth = class, .pred_class)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.993

Now proceed to the test set:

rf_testing_pred <- 
  predict(rf_fit, cell_test) %>% 
  bind_cols(predict(rf_fit, cell_test, type = "prob")) %>% 
  bind_cols(cell_test %>% select(class))

And look at performance results from prediction with the test set:

rf_testing_pred %>%                   # test set predictions
  roc_auc(truth = class, .pred_PS)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.909
rf_testing_pred %>%                   # test set predictions
  accuracy(truth = class, .pred_class)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.837

Whoops! Our performance results with the training set were a little too good to be true!

Fit a model with resampling

Create cross-validation (CV) folds (for 10-fold CV) using vfold_cv() from the rsample package:

set.seed(345)
folds <- vfold_cv(cell_train, v = 10)
folds
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits             id    
##    <named list>       <chr> 
##  1 <split [1.4K/152]> Fold01
##  2 <split [1.4K/152]> Fold02
##  3 <split [1.4K/152]> Fold03
##  4 <split [1.4K/152]> Fold04
##  5 <split [1.4K/152]> Fold05
##  6 <split [1.4K/151]> Fold06
##  7 <split [1.4K/151]> Fold07
##  8 <split [1.4K/151]> Fold08
##  9 <split [1.4K/151]> Fold09
## 10 <split [1.4K/151]> Fold10

Do you recall working with workflow() from the second article, Preprocess your data with recipes?

Use a workflow() that bundles together the random forest model and a formula.

rf_wf <- 
  workflow() %>%
  add_model(rf_mod) %>%
  add_formula(class ~ .)

Now apply the workflow and fit the model with each fold:

set.seed(456)
rf_fit_rs <- 
  rf_wf %>% 
  fit_resamples(folds)

rf_fit_rs
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits             id     .metrics         .notes          
##    <list>             <chr>  <list>           <list>          
##  1 <split [1.4K/152]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]>
##  2 <split [1.4K/152]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]>
##  3 <split [1.4K/152]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]>
##  4 <split [1.4K/152]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]>
##  5 <split [1.4K/152]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]>
##  6 <split [1.4K/151]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]>
##  7 <split [1.4K/151]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]>
##  8 <split [1.4K/151]> Fold08 <tibble [2 × 3]> <tibble [0 × 1]>
##  9 <split [1.4K/151]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]>
## 10 <split [1.4K/151]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]>

Do you see the added columns .metrics and .notes?

Collect and summarize performance metrics from all 10 folds:

collect_metrics(rf_fit_rs)
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.833    10 0.0111 
## 2 roc_auc  binary     0.903    10 0.00842

Now these are more realistic results!